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Abstract 

Support Vector Machines, SVMs, and the 
Large Margin Nearest Neighbor algorithm, 
LMNN, arc two very popular learning algo- 
rithms with quite different learning biases. In 
this paper we bring them into a unified view 
and show that they have a much stronger re- 
lation than what is commonly thought. We 
analyze SVMs from a metric learning per- 
spective and cast them as a metric learning 
problem, a view which helps us uncover the 
relations of the two algorithms. We show that 
LMNN can be seen as learning a set of local 
SVM-like models in a quadratic space. Along 
the way and inspired by the metric-based in- 
terpretation of SVMs we derive a novel vari- 
ant of SVMs, e-SVM, to which LMNN is 
even more similar. We give a unified view 
of LMNN and the different SVM variants. 
Finally we provide some preliminary exper- 
iments on a number of benchmark datasets 
in which show that e-SVM compares favor- 
ably both with respect to LMNN and SVM. 



1 Introduction 

Support Vector Machines, [2], and metric learning al- 
gorithms, [15, 8, 9, 16], are two very popular learn- 
ing paradigms with quite distinct learning biases. In 
this paper we focus on SVMs and LMNN, one of the 
most prominent metric learning algorithms, [15]; we 
bring them into a unified view and show that they 
have a much stronger relation than what is commonly 
accepted. 
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We show that SVM can be formulated as a metric 
learning problem, a fact which provides new insights 
to it. Based on these insights and employing learn- 
ing biases typically used in metric learning we propose 
e-SVM, a novel S VM variant which is shown to empir- 
ically outperform SVM. e- SVM relates, but is simpler, 
to the radius-margin ratio error bound optimization 
that is often used when SVM is coupled with feature 
selection and weighting, [12, 5], and multiple kernel 
learning [1, 4, 7]. More importantly we demonstrate 
a very strong and previously unknown connection be- 
tween LMNN and SVM. Until now LMNN has been 
considered as the distance-based counterpart of SVM, 
in the somehow shallow sense that both use some con- 
cept of margin, even though their respective margin 
concepts are defined differently, and the hinge loss 
function, within a convex optimization problem. We 
show that the relation between LMNN and SVM is 
much deeper and demonstrate that LMNN can be seen 
as a set of local SUM- like classifiers in a quadratic 
space. In fact we show that LMNN is even more similar 
to e-SVM than SVM. This strong connection has the 
potential to lead to more efficient LMNN implemen- 
tations, especially for large scale problems and vice 
versa, to lead to more efficient schema of multiclass 
SVM. Moreover the result is also valid for other large 
margin metric learning algorithms. Finally we use the 
metric-based view to provide a unified view of SVM, 
e-SVM and LMNN. 

Overall the main contribution of the paper is the uni- 
fied view of LMNN, SVM and the variants of the latter. 
Along the way we also devise a new algorithm, e-SVM, 
that combines ideas from both SVM and LMNN and 
finds its support in the SVM error bound. 

In the next section we will briefly describe the basic 
concepts of SVM and LMNN. In section 2 we describe 
SVM in the metric learning view and the new insights 
that this view brings. We also discuss some invariant 
properties of SVM and how metric learning may or 
may not help to improve SVM. Section 3 describes e- 
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SVM, a metric- learning and SVM inspired algorithm. 
Section 4 discusses the relation of LMNN and SVM, 
and provides a unified view of the different SVM and 
LMNN variants. In section 6 we give some experimen- 
tal results for e-SVM and compare it with SVM and 
LMNN. Finally we conclude in section 7. 

1.1 Basic S VM and LMNN concepts 

We consider a binary classification problem in which 
we are given a set of n learning instances S = 
{(xi,j/i), (x„,y„)},Xj e TZ d , where is the class 
label of the Xj instance and y% € {+1,-1}. We 
denote by ||x|| p the l p norm of the x vector. Let 
H w be a hyperplane given by w T x + 6 = 0; the 
signed distance, d(xj, -ff w ), of some point Xj from the 
H w hyperplane is given by: d(x;,£f w ) = W |j^|]+ b . 
SVMs learn a hyperplane i? w which separates the two 
classes and has maximum margin 7, which is, infor- 
mally, the distance of the nearest instances from ff w : 
7 = minj[t/j<i(xj, H w )] + [2]. The optimization problem 
that SVMs solve is: 

j/i(w T Xj + 6) 
max 7 s.f. r — jj > 7, vi (1) 

w,6,7 ll w l|2 

which is usually rewritten to avoid the uniform scaling 
problem with w, b, as: 



mm || w 

w,6 



s.t. 2/i(w T Xi +6) > l,Vi (2) 



The margin maximization is motivated by the SVM 
error bound which is a function of the R 2 /-y 2 ratio; R 
is the radius of the smallest sphere that contains the 
learning instances. Standard SVMs only focus on the 
margin and ignore the radius because for a given fea- 
ture space this is fixed. However the R 2 /j 2 ratio has 
been used as an optimization criterion in several cases, 
for feature selection, feature weighting and multiple 
kernel learning [1, 12, 4, 5, 7]. The biggest challenge 
when using the radius-margin bound is that it leads to 
non convex optimization problems. 

In metric learning we learn a Mahalanobis metric 
parametrized by a Positive Semi-Definite (PSD) ma- 
trix M under some cost function and some constraints 
on the Mahalanobis distances of same and different 
class instances. The squared Mahalanobis distance has 
the following form d^(xj, Xj) = (xj — Xj) T M(xi — Xj). 
M can be rewritten asM = L T L, i.e the Mahalanobis 
distance computation can be considered as a two-step 
procedure, that first computes a linear transforma- 
tion of the instances given by the matrix L and then 
the Euclidean distance in the transformed space, i.e 
c^(x;,x.,) = eP(Lxj,Lxj). 

LMNN, one of the most popular metric learning meth- 
ods, also works based on the concept of margin which 



is nevertheless different from that of SVM. While the 
SVM margin is defined globally with respect to a hy- 
perplane, the LMNN margin is defined locally with 
respect to center points. The LMNN margin, 7 Xo , of a 
center point, instance x , is given by: 

7x = min[rf^ [ (xo,x J ) - d 2 A (x a ,x i )} + (3) 
y ^ yj,Xi € targ(x ) 

where targ(x ) is the set of the target neighbors 
of Xo, which is defined as targ(xo) = {x^x^ G 
neighborhood of xo A yo == yi}. The neighborhood 
can be defined cither as a k nearest neighbors or as a 
sphere of some radius. 

LMNN maximizes the sum of the margins of all in- 
stances. The underlying idea is that it learns a metric 
M or a linear transformation L which brings instances 
close to their same class center point while it moves 
away from it different class instances with a margin of 
one. This learning bias is commonly referred as large 
margin metric learning [13, 15, 14, 10]. LMNN opti- 
mization problem is [15]: 



mm 



s.t. 



H {dmi^^ + C^il-yiyi^iji) 

^m( x ^ X i) — ^m( X 15 X j) — 1 — £ijl-> 

V(j|xjetarff(xi)),Vi,i; £ > 0, M h (4) 



2 SVM under a metric learning view 

We can formulate the SVM learning problem as a met- 
ric learning problem in which the learned transforma- 
tion matrix is diagonal. To do so we will first define 
a new metric learning problem and then we will show 
its equivalence to SVM. 

We start by introducing the fixed hyperplane Hi, the 
normal vector of which is 1 (a <i-dimcnsional vector of 
ones), i.e. Hi : l T x + b = 0. Consider the following 
linear transformation x = Wx, where W is a d x d 
diagonal matrix with diag(~W) = w = (wi, . . . , Wd) T ■ 

We now define the margin 7 of two classes with re- 
spect to the hyperplane Hi as the minimum abso- 
lute difference of the distances of any two, different- 
class instances, x^Xj, projected to the norm vector of 
Hi hyperplane, which can be finally written as: 7 = 

|1 T W(X,-X,)| . I j, - TT \ 

lllm <,,.// • 75 — = lnlll <,;,v • ■/ </:x,.//i : 

d(xj,Hi)\. In fact this is the S VM margin, see equa- 
tion (17) in Appendix. 

We are interested in that linear transformation, 
diag(W) = w, for which it holds yi(w T Xi + b) > 0, 
i.e. the instances of the two classes lie on two dif- 
ferent sides of the hyperplane Hi, and which has a 
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maximum margin. This is achieved by the following 
metric-learning optimization problem : 

max min (x; - x 3 ) T ww T (xi - x_, ) (5) 

s.t. y l (w T x l + b)> 0,Vi 



<S4> max 7 

w,7 



(6) 



s.t. (w T (x, - Xj )) 2 > j 2 y{i,j,yi ^ yj) 
yi(-w T Xi + b) > 0,Vi 

Additionally we prefer the W transformation that 
places the two classes "symmetrically" with respect 
to the Hi hyperplane, i.e. the separating hyperplanc 
is at the middle of the margin instances of the two 
classes: y i (w T x i + b) > 7/2, Vi. Notice that this can 
always be achieved by adjusting the translation pa- 
rameter b by adding to it the 71 ~ 72 value, where 71 
and 72 are respectively the margins of the y\ and y<i 
classes to the Hi hyperplane. Hence the parameter 
b of the Hi hyperplane can be removed and replaced 
by an equivalent translation transformation b. From 
now on, we assume Hi : l T x + = and we add a 
translation b after the linear transformation W. With 
this 'symmetry' preference, (6) can be reformulated as 
(see the detailed proof in Appendix): 



max 

w,b,7 



7 s.t. j/j(w T x i + b) > 7, Vi 



(7) 



This optimization problem learns a diagonal transfor- 
mation W and a translation b which maximize the 
margin and place the classes 'symmetrically' to the 
Hi : l T x = hyperplane. However this optimization 
problem scales with the w (as is the case with the SVM 
formulation (1)). Therefore we need a way to control 
the uniform scaling of w. One way is to fix some norm 
of w e.g. ||w|| p = 1, and the optimization problem 
becomes: 



max 7 



s.t. y;(w T Xi + b)> 7 ,Vi, ||w|| p = 1(8) 



Notice that both problems (7) and (8) are still differ- 
ent from the standard SVM formulation given in (1) 
or (2). However it can be shown that they are in fact 
equivalent to SVM; for the detailed proof see in the 
Appendix. 

Thus we see that starting from a Mahalanobis metric 
learning problem (5), i.e learning a linear transforma- 
tion W, and with the appropriate cost function and 
constraints on pairwise distances, we arrive to a stan- 
dard SVM learning problem. We can describe SVM 
in the metric learning jargon as follows: it learns a 

1 Notice that (w T (x^ — Xj)) 2 = (xj — Xj) T ww T (x, — Xj) 
is the Mahalanobis distance associated with the rank 1 
matrix M = ww T 



diagonal linear transformation W and a translation b 
which maximize the margin and place the two classes 
symmetrically in the opposite sides of the hyperplanc 
Hi : l T x = 0. In the standard view of SVM, the space 
is fixed and the hyperplane is moved around to achieve 
the optimal margin. In the metric view of SVM, the 
hyperplane is fixed to Hi and the space is scaled, W, 
and then translated, b, so that the instances are placed 
optimally around H±. Introducing a fixed hyperplane 
Hi will provide various advantages in relation to the 
different radius-margin SVM versions and LMNN as 
we will show soon. It is also worth to note that the 
SVM metric view holds true for any kernel space, since 
its final optimization problem is the same as that of 
standard SVM, i.e we can kernelize it directly as SVM. 

From the metric learning perspective, one may think 
that learning a full linear transformation instead of 
a diagonal could bring more advantage, however as 
we will see right away this is not true for the case of 
SVM. For any linear transformation L (full matrix), 
the distance of Lx to the hyperplane Hi : lx + b = 
is: d(Lx,Hi) = lT ^+ b = lT ^ x+fc where D L is a 
diagonal matrix, in which the k diagonal element cor- 
responds to the sum of the elements of the fcth column 
of L: Dl^. = Ljfe. So for any full transformation 
matrix L there exists a diagonal transformation Dl 
which has the same signed distance to the hyperplane. 
This is also true for any hyperplane w T x + b where w 
does not contain any zero elements. Thus learning a 
full matrix does not bring any additional advantage to 
SVM. 



3 e-SVM, an alternative to the 
radius-margin approach 

A common bias in metric 
learning is to learn a met- 
ric which keeps instances of 
the same class close while 
pushing instances of dif- 
ferent classes far away []. 
This bias is often imple- 
mented through local or 
global constraints on the 
pair- wise distances of same- 
class and different-class in- 
stances which make the be- 
tween class distance large 
and the within class dis- 
tance small. Under the 
metric view of SVM we 
see that the learned linear 

transformation does control the between class distance 
by maximizing the margin, but it ignores the within 




Figure 1: Within class 
distance: controlling 
the radius R or the 
instance- hyperplane 
distances r\i. 
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class distance. Inspired by the metric learning biases 
we will extend the SVM under the metric view and 
incorporate constraints on the within-class distance 
which we will minimize. 

We can quantify the within class distance with a num- 
ber of different ways such as the sum of instance dis- 
tances from the centers of their class, as in Fisher Dis- 
criminant Analysis (FDA) [11], or by the pairwise dis- 
tances of the instances of each class, as it is done in sev- 
eral metric learning algorithms [16, 3, 15]. Yet another 
way to indirectly minimize the within class distance is 
by minimizing the total data spread while maximizing 
the between class distance. One measure of the total 
data spread is the radius, R, of the sphere that con- 
tains the data; so by minimizing the radius-margin ra- 
tio, R 2 /7 2 , we can naturally minimize the within class 
distance while maximizing the between. Interestingly 
the SVM theoretical error bound points towards the 
minimization of the same quantity, i.e. the radius- 
margin ratio, thus the proposed metric learning bias 
finds its theoretical support in the SVM error bound. 
However optimizing over the margin and the radius 
poses several difficulties since the radius is computed 
in a complex form, [12, 4, 7]. 

We can avoid the problems that are caused by the ra- 
dius by using a different quantity to indirectly control 
the within class distance. We propose to minimize in- 
stead of the radius, the sum of the instance distances 
from the margin hyperplane, this sum is yet another 
way to control the data spread. We will call the re- 
sulting algorithm, which in addition to the margin 
maximization also minimizes the within-class-distance 
through the sum of the instance distances from the 
margin hyperplane, e-SVM. The minimization of the 
sum has a similar effect to the minimization of the 
S VM radius, Figure 1. In a later section we will show 
that e-SVM can be seen as a bridge between LMNN 
and SVM. 

We define the optimization problem of e-SVM as fol- 
lows: select the transformations, one linear diagonal, 
diag(W) = w, and one translation, b, which maximize 
the margin and keep the instances symmetrically and 
within a small distance from the Hi hyperplane. This 
is achieved by the following optimization problem: 

min w T w + A >^ max(0, j/i(w T Xi + fe) — 1) (9) 

w.h ^— ' 

i 

+C^2max(0, 1 - yi{w T Xi + &)) 

i 

max(0, j/i(w T Xi + b) — 1) penalizes instances which lie 
on the correct side of the hyperplane but are far from 
the margin. max(0, 1 — j/i(w T Xi + b) is the SVM hinge 
loss which penalize instances violating the margin. (9) 



is equivalent to: 
min w T w + A y^?7i + cT^Ci (10) 

w.6,£.?7 — 4 — ' 

i i 

s.t. 1-6 <y 4 (w T x, + fe) < l + r]i,Vi,Z,ri > 

where 6 is the standard SVM slack variable for in- 
stance i and r\i the distance of that instance from the 
margin hyperplane. The dual form of this optimiza- 
tion problem is: 

max =r - Pi)( a i ~ Pj)ViVjXi*3 + ~ ft) 

cx.p J 

s.t. - Pi)Vi = 0; < at < C; < ft < A,V<11) 

Note that we have two hyper-parameters C and A, 
typically we will assign a higher value to the C since 
we tolerate less the violations of the margin compared 
to a larger data spread. The two parameters control 
the trade off between the importance of the margin 
and the within data spread. 

4 On the relation of LMNN and SVM 

In this section we demonstrate the relation between 
LMNN and SVM. Lets define a quadratic mapping 
$ that maps x to a quadratic space where *(x) = 

(^1) x 2i ■ ■ ■ i x "di x i x j{d>i>j>l}i x li x 2, ■ ■ ■ i x d) G TZ^, 

d! = . We will show that the LMNN margin in 

the original space is the SVM margin in this quadratic 
space. 

Concretely, the squared distance of an instance x from 
a center point instance x; in the linearly transformed 
space that corresponds to the Mahalanobis distance M 
learned by LMNN can be expressed in a linear form of 
<fr(x). Let L be the linear transformation associated 
with the learned Mahalanobis distance M. We have: 

d^(x,x ; ) - d 2 (Lx,Lx ; ) (12) 
= x T L T Lx - 2xf L T Lx + xf L T Lx ; 
= wf*(x) + fe ; 

where bi = x^L T Lx; and w; has two parts, quadratic 
w quad anc j ij near w p n : W; = ( w <? Mad ;W p n ) where 
w gttad j g vec t or f the coefficients of the quadratic 
components of 3?(x), with dl — d elements given by 
the elements of L T L, and w' m is equal to — 2L T Lx/ — 
the vector of coefficients of the linear components of 
<fr(x). gP(Lx, Lx;) is proportional to the distance of 
<I>(x) from the hyperplane Hi : w^<I>(x) + bi = 
since (i 2 (Lx, Lx() = ||w;||d(<I>(x), Hi). Notice that this 
value is always non negative, so in the quadratic space 
3?(x) always lies on the positive side of Hi. 

A more general formulation of the LMNN optimization 
problem (4) which reflects the same learning bias of 
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LMNN, i.e for each center point, keeps the same-class 
instances close to the center and pushes different-class 
instances outside the margin, is: 

M,mm, vi E^ + A E E (4i(x*,x ; ) 



I x < I Xi£targ(xi) 

c E^ 1 ~ yjyO&ji) 



(13) 



s.t. dlatxj^i) -dM(xi,xj) > j Xl - Z, ijh 
V(i|xj e targ(xi)),Vj,l; 
7x, > 0,VZ, £ > 0, M h 

Standard LMNN puts one extra constraint on each 7 Xi , 
it sets each of them to one. With this constraint prob- 
lem (13) reduces to (4). We can rewrite problem (13) 
as: 



min -\- + C £, 

L,«,«I,VI.7x,,VI 



75 



(14) 

i it 
+ A ^2 Yl d 2 (Lxi,Lx ; ) 

i XiStarafxj) 

s.t. d 2 (Lxi, Lx ; ) < i? 2 + fo,Vxi 6 targ(x.i); Vx ( 

d 2 (Lx 3 , Lx( ) > i? 2 + 7xi - £y , V(xj | j/j ^ y, ) ; Vx ; , 

7x ; > oyi,i > o 

In this formulation, visualized in fig. 2(a), the target 
neighbors (marked with x) of the x; center point are 
constrained within the C'i circle with radius Ri and 
center Lx; while the instances that have a label which 
is different from that of x; (marked with □) are placed 
outside the circle C" with center also Lx; and radius 
R i = V R f + 7x ( • Wc denote by # the difference R'{ - 
Ri of the radii of the two circles. 

If we replace all the <i 2 (Lx, Lz) terms in problem (14) 
with their linear equivalent in the quadratic space, 
problem (12), and break up the optimization problem 
to a series of n optimization problems one for each 
instance x ; then we get for each x ; the following opti- 
mization problem: 



mm 



7 



w + C E & 



(15) 

x,es(x,) 

+ A (wf(*(xi) - *(x,))) 

s.t. wf*(xi) + b'l < -7x,/2 + £/i,Vx i 6 targ(x t ) 
wF*(xj) + b[ > 1* l /2-£i j ,V-x. j ,y j =£yi 
wf(*( Xl ) - *(x,)) > 0,Vx s G B(x,), 7x, > 0,£, > 

where w ; is not an independent variable but its 
quadratic and linear components are related as de- 
scribed above (formula (12)). The instance set 
B(xj) = {xi e targfa)} U {x^ ^ ?/;} is the set 
of all target neighbors and different class instances. 
In the formula (15) we replace bi by — w^<&(xj) since 



wf *(x,) + fe; = 0, and &{ = &,- (i? 2 + 7x ,/2). We 
denote the hyperplane w ; T $(x) + 6J = by as in 
fig. 3. 

Now let y u := yiy u so jfo = 1, Vx 4 e target) and j/y = 
-l,V(x i | W ^ yi). Therefore - Wl (w^( X! ) + &J) > 
7 Xi /2 — ^;j,Vxj e B(xi). Without loss of generality we 
can assume that yi = —1 and problem (15) becomes: 



mm 



i + c E & 

/xj 



(16) 



x I es(x ! ) 

+ A ^ (w ; T (*( Xl ) - *(x,)) 

x^ £iarg(x; ) 

s.t. j/i(wr*(xi) + 6|) > 7xi /2 - ^i.Vxi G B(x ; ) 
wj r (*(x l ) - *(x,)) > 0,Vx, G B(x,), 7x ; > 0,6 > 

where w; is still constrained as in (15). It is worth 
to note that the second constraint of (16) will ensure 
that d^[(xj,xj) is bigger than or equal to 0, i.e it will 
almost ensure the PSD of the matrix M 2 . However 
due to the specific structure of w;, i.e. quadratic and 
linear part, the PSD constraint of M is guaranteed for 
any x. 

We can think of problem (16) as an extension of an 
SVM optimization problem. Its training set is the 
B(xi) set, i.e. the target neighbors of X; and all in- 
stances with different class label from x; . Its cost func- 
tion includes, in addition to the term that maximizes 
the margin, also a sum term which forces the target 
neighbors of x; to have small wf&(xi) + bi values, i.e. 
be at small distance from the Hi hyperplane. 

Minimizing the target neighbor distances from the Hi 
hyperplane makes the distance between support vec- 
tors and Hi small. It therefore has the effect of bring- 
ing the negative margin hyperplane of H[ close to Hi, 
bringing thus also the target neighbors close to the 
negative margin hyperplane. In other words the op- 
timization problem favors a small width for the band 
that is defined by Hi and the negative margin hyper- 
plane described above which contains the target neigh- 
bors. 

There is an even closer relation of LMNN with a local 
e-S VM applied in the quadratic space that we will de- 
scribe by comparing the optimization problems given 
in (9) and (16). e-SVM has almost the same learn- 
ing bias as LMNN, the former maximizes the margin 
and brings all instances close to the margin hyper- 
planes, the latter also maximizes the margin but only 
brings the target neighbors close to the margin hyper- 
plane. For each center point xj the LMNN formula- 
tion, problem (16), is very similar to that of e-SVM, 



2 Almost in the sense that the constraint ensures that 
(x — x ; ) T M(x — X;) > for all x, x ; in the training dataset, 
but can not guarantee the same for a new instance x; to 
ensure M is PSD, we need z T Mz > for all z. 
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problem (9), applied in the quadratic space. The sec- 
ond term of the cost function of e-SVM together with 
its constraints force all instances to be close to their 
respective margin hyperplane, this can be seen more 
clear in the minimization of the ^2 Vi m problem (10). 
In LMNN, problem (16), the third term of the cost 
function plays a similar role by forcing only the target 
neighbors to be close to the Hi hyperplane. This in 
turn has the effect, as mentioned above, to reduce the 
width of the band containing them and bringing them 
closer to their margin hyperplane; unlike e-SVM no 
such constraint is imposed on the remaining instances. 
So LMNN is equivalent to n local, modified, e-SVMs, 
one for each instance. e-SVM controls the within class 
distance using the distance to the margin hyperplane 
and LMNN using the distance to the Hi hyperplane. 

Note that the n optimization problems are not in- 
dependent. The different solutions w/ have a com- 
mon component which corresponds to the d(d + l)/2 
quadratic features and is given by the wi uad vector. 
The linear component of w;, w 1 " 1 , is a function of 
the specific xj, as described above. Overall learning 
a transformation L with LMNN, eq. (4) , is equivalent 
to learning a local SVM- like model, given by H[, for 
each X/ center point in the quadratic space according 
to problems (15,16). Remember that the w/,6/ of the 
different center points are related, i.e. their common 
quadratic part is -w quad . If we constrain xvi,bi to be 
the same for all x; and drop the PSD constraints on 
w; then we get the (global) e-SVM-like solution. 

Visualization: In fig. 2(b) we give a visualization of 
problem (15) in the quadratic space 3>(x). Figure 2(b) 
gives the equivalent linear perspective in the quadratic 
space of the LMNN model in the original space given 
in fig. 2(a). The center Lx; of the Ci circle in fig. 2(a) 
corresponds to the Hi hyperplane in fig. 2(b); the C[ 
circle with center Lx; and radius R\ = Ri + 0i/2 cor- 
responds to the Hi separating hyperplane in fig. 2(b). 
Figure 3(a) illustrates the different local linear models 
in the quadratic space. We can combine these dif- 
ferent local models by employing the metric learning 
view of SVM and make the relation of LMNN and 
SVM even more clear. Instead of having many lo- 
cal SVM-like hyperplanes we bring each point 4>(x;) 
around the Hi : lx + = hyperplane by applying 
to it first a W/ diagonal transformation, W; = w/, 
and then a translation (fig. 3(b)). As before with w; 
the different W; transformations have a common com- 
ponent, which corresponds to the first d(d + l)/2 el- 
ements of the diagonal associated with the quadratic 
features, given by the w quad vector, and an instance 
dependent component that corresponds to the linear 
features which is given by Wj = — 2L T Lx;; thus the 
translation transformation is also a function of the spe- 



cific point x; . Notice that the common component has 
many more elements than the instance specific compo- 
nent. There is an analogy to multitask learning where 
models are learned over different tasks — datasets are 
forced to have parts of their models the same. 
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Figure 2: Alternative views on LMNN 
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Figure 3: On the relation of LMNN and SVM 

Prediction phase Typically after learning a metric 
through LMNN a k-NN algorithm is used to predict 
the class of a new instance. Under the interpreta- 
tion of LMNN as a set of local e-SVM linear classifiers 
this is equivalent to choosing the k local hyperplanes 
H[ : wf *(x) + b[ = which are farther away from 
the new instance and which leave both the new in- 
stance and their respective center points on the same 
side. The farther away the new instance is from an 
H[ local hyperplane the closer it is to the center point 
associated to H[. In effect this means that we simply 
choose those hyperplanes — classifiers which are more 
confident on the label prediction of the new instance, 
and then we have them vote. In a sense this is similar 
to an ensemble learning schema in which each time we 
want to classify a new instance, we select those classi- 
fiers that are most confident. Then we have them vote 
in order to determine the final prediction. 

5 A unified view of LMNN, SVM and 
its variants 

The standard SVM learning problem only optimizes 
the margin. On the other hand, LMNN, as well as 
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Figure 4: Relation among local SVM, local e-SVM, global 
SVM, global e-SVM and LMNN in the quadratic space. 



the different variants of radius-margin based SVM un- 
der the metric learning view (briefly mentioned in sec- 
tion 3), and e-SVM, have additional regularizers that 
control directly or indirectly the within class distance. 
LMNN is much closer to a collection of local e-SVMs 
than to local SVMs, since both LMNN and e-SVM 
have that additional regularizer of the within class dis- 
tance. We briefly summarize the different methods 
that we discussed: 

• In standard SVM the space is fixed and the separat- 
ing hyperplane is moved to find an optimal position 
in which the margin is maximized. 

• In standard LMNN we find the linear transformation 
that maximizes the sum of the local margins of all 
instances while keeping the neighboring instances of 
the same class close. 

• In the metric learning view of SVM we find a linear 
diagonal transformation W, diag(W) = w plus a 
translation b that maximize the margin to the fixed 
hyperplane H\, clamping the norm of w to 1, or 
maximize the norm of w while clamping the margin 
to 1. 

• In e-SVM we find a linear diagonal transformation 
and a translation that maximize the margin with 
respect to the fixed hyperplane Hi and keep all in- 
stances close to H\. 

• Finally in the interpretation of LMNN as a collec- 
tion of local SVM-like classifiers we find for each 
instance x a linear transformation and a transla- 
tion that clamp the margin to one and keep the 
target neighbors of x within a band the width of 
which is minimized. These transformations have 
common components as described in the previous 
section. LMNN is set of local related e-SVMs in the 
quadratic space. 

We will now give a high level picture of the relations of 
the different methods in the quadratic space #(x) by 
showing how from one method we can move to another 
by adding or dropping constraints; the complete set of 
relations is given in Figure 4. 



We start with n local non-related SVMs, if we add a 
constraint on the within-class distance to each of them 
we get n non-related e-SVMs. If we add additional con- 
strains that relate the different local e-SVMs, namely 
by constraining their quadratic components w quad to 
be the same and PSD, and their linear component wj in 
to be a function of w quad and the center point x; then 
we get LMNN. 

If we go back to the original set of non- related SVMs 
and add a constraint that forces all of them to have 
the same wj and b[ we get the standard global SVM. 
Similarly if in the collection of the local e-SVMs we 
add a constraint that forces all of them to have the 
same optimal hyperplane then we also get the global 
e-SVM. In that case all the component classifiers of 
the ensemble reduce to one classifier and no voting is 
necessary. 

LMNN constrains the n local e-SVMs to have the same 
quadratic component, w quad , makes the linear compo- 
nents dependent on -w quad and their respective center 
points, accommodating like that the local information, 
and constrains the quadratic components w quad to re- 
flect the L T L PSD matrix, equation (12). On the other 
hand both the global SVM and e-SVM also constrain 
their quadratic components to be the same but re- 
move the constraint on the PSD property of w quad ; 
both constrain all their linear components to be the 
same, thus they do not incorporate local information. 
The last constraint is much more restrictive than the 
additional constrains of LMNN, as a result the global 
SVM and e-SVM models are much more constrained 
than that of LMNN. Although we demonstrate the re- 
lations only for the case of SVM and LMNN they also 
hold for all other large margin based metric learning 
algorithms [13, 15, 14, 10]. 

e-SVM builds upon two popular learning biases, 
namely large margin learning and metric learning and 
exploits the strengths of both. In addition it finds 
theoretical support in the radius-margin SVM error 
bound. e-SVM can be seen as a bridge which connects 
SVM and existing large margin based metric learning 
algorithms, such as LMNN. 

6 Experiments and results 

In addition to studying in detail the relation between 
the different variants of SVM and metric learning we 
also examine the performance of the e-SVM algorithm. 
We experiment with e-SVM, eq. (10), and compare its 
performance to that of a standard l\ soft margin SVM, 
and LMNN, with the following kernels: linear, polyno- 
mial with degree 2 and 3, Gaussian with a = 1. C is 
chosen with 10-fold inner Cross- Validation from the set 
{0.1, 1, 10, 100, 1000}. For e-SVM we choose to set A 
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Table 1: Classification error. D gives the number of features of each dataset and N the number of examples. A bold 
entry in e-SVM or SVM indicates that the respective method had a lower error, similarly for an italics entry for the 
e-SVM and LMAWpair. 



Kernel 


Dataset 


SVM 


e-SVM 


LMNN 


Dataset 


SVM 


e-SVM 


LMNN 


linear 


N=62 


17.74 


17.74 


20.97 


N=198 


25.25 


21.21 


31.31 


poly2 


D=2000 


35.48 


35.48 


35.48 


D=34 


24.24 


23.74 


29.80 


poly3 




35.48 


24.19 


27.42 




20.20 


20.71 


31.31 


gauss 1 


colonTumor 


35.48 


35.48 


35.48 


wpbc 


23.74 


23.74 


23.74 


linear 


N=60 


40.00 


35.00 


45.00 


N=351 


7.98 


11.97 


8.26 


poly2 


D=7129 


35.00 


35.00 


35.00 


D=34 


5.98 


5.98 


7.12 


poly3 




36.67 


36.67 


33.33 




6.27 


6.27 


5.98 


gauss 1 


centralNS 


35.00 


35.00 


35.00 


ionosphere 


10.82 


5.41 


15.38 


linear 


N=134 


10.45 


6.72 


11.94 


N=569 


2.28 


3.69 


4.04 


poly2 


D=1524 


60.45 


53.73 


54.48 


D=30 


3.51 


4.75 


4.22 


poly3 




29.85 


22.39 


29.85 




2.64 


2.28 


3.34 


gauss 1 


femaleVsMale 


60.45 


60.45 


60.45 


wdbc 


16.34 


6.85 


11.25 


linear 


N=72 


1.39 


1.39 


1.39 


N=354 


31.01 


31.30 


36.23 


poly2 


D=7129 


34.72 


31.94 


33.33 


D=6 


30.72 


29.86 


36.23 


poly3 




34.72 


12.50 


33.33 




29.57 


30.43 


31.59 


gauss 1 


Leukemia 


34.72 


34.72 


34.72 


liver 


32.17 


32.46 


38.26 


linear 


N=208 


32.69 


23.08 


12.02 


N=476 


17.02 


15.76 


4.83 


poly2 


D=60 


19.23 


17.31 


13.94 


D=166 


6.30 


6.93 


3.99 


poly3 




13.46 


12.98 


12.50 




4.20 


4.83 


5.25 


gauss 1 


sonar 


42.79 


34.62 


42.79 


muskl 


43.49 


39.92 


43.07 



Table 2: McNemar score, loose: 0, win: 1, equal 0.5 



Kernel 


SVM 


e-SVM 


LMNN 


linear 


9 


10 


11 


poly2 


10 


10 


10 


poly3 


10 


11.5 


8.5 


gaussl 




14.5 




Total 


37.5 


46 


36.5 



to C /3 reflecting the fact that we tolerate less the mar- 
gin violations than a larger distance form the margin. 
We used ten datasets mainly from the UCI repository 
[6] . Attributes are standardized to have zero mean and 
one variance; kernels are normalized to have a trace of 
one. The number of target neighbors for LMNN is set 
to 5 and its 7 parameter is set to 0.5, following the 
default settings suggested in [15]. We estimated the 
classification error using 10-fold CV. The results are 
given in Table 1. Overall each algorithm is applied 
40 times (four kernels x ten datasets). Comparing e- 
SVM with SVM we see that the former has a lower 
error than SVM in 19 out of the 40 applications and a 
higher error in only nine. A similar picture appears in 
the e-SVM, LMNN, pair where the former has a lower 
error in 20 out of the 40 applications and a higher in 
only eight. If we break down the comparison per ker- 
nel type, we also see that, i.e. e-SVM has a systematic 
advantage over the other two algorithms no matter 
which kernel we use. 

To examine the statistical significance of the above re- 
sults we use the McNemar's test of significance, with 
a significance level of 0.05. Comparing algorithms A 
and B on a fixed dataset and a fixed kernel algorithm 
A is credited with one point if it was significantly bet- 
ter than algorithm B, 0.5 points if there was no sig- 
nificance difference between the two algorithms, and 



zero points otherwise. In Table 2 we report the over- 
all points that each algorithms got and in addition we 
break them down over the different kernels. e-SVMh&s 
the highest score with 46 points followed by SVM with 
37.5 and LMNN with 36.5. The advantage of e-SVM 
is much more pronounced in the case of the Gaussian 
kernel. This could be attributed to its additional reg- 
ularization on the within-class distance which makes 
it more appropriate for very high dimensional spaces. 

7 Conclusion 

In this paper, we have shown how SVM learning can 
be reformulated as a metric learning problem. In- 
spired by this reformulation and the metric learning 
biases, we proposed e-SVM, a new SVM-bascd algo- 
rithm in which, in addition to the standard SVM con- 
straints we also minimize a measure of the within class 
distance. More importantly the metric learning view 
of SVM helped us uncover a so far unknown connec- 
tion between the two seemingly very different learning 
paradigms of SVM and LMNN. LMNN can be seen 
as a set of local SVM- like classifiers in a quadratic 
space, and more precisely, as a set of local e-S VM-likc 
classifiers. Finally preliminary results show a supe- 
rior performance of e-SVM compared to both SVM 
and LMNN. Although our discussion was limited to 
binary classification, it can be extended to the multi- 
class case. Building on the SVM-LMNN relation, our 
current work focuses on the full analysis of the mul- 
ticlass case, a new schema for multiclass SVM which 
exploits the advantages of both LMNN and kNN in 
multiclass problems, and finally the exploration of the 
learning models which are in between the SVM, LMNN 
models. 
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Appendix 

Showing the equivalence of problem (6) and 
problem (7) (Section 2): 

With the 'symmetry' preference as described after 
problem (6) , the second constraint of problem (6) can 
be replaced by the constraint: yi(w T Xi + b) > -f/2. 
Moreover, we can always replace the squared values 
in problem (6) by their respective absolute values and 
get: 

max 7 s.t. |w T (xi - Xj)| > 7, ^ y i 

w 

y,(w T x 4 + 6) >7/2,Vi, 

If the constraint j/i(w T Xi + b) > 7/2 is satisfied then 
w T Xi + b and yt have the same sign; therefore any two 
instances Vx^ , Xj which have different labels, (y, ^ yj), 
will lie on the opposite sides of the hyperplane. Hence: 

|w T ( Xj -x,)| = |(w T x 4 + 6)-(w T x,+6)| (17) 

= |w T x, + b\ + |w T Xj - + b\ 

= yi (w T x 4 + b) + yj (w T x., + b) 

> 7/2 + 7/2 = 7 

Therefore the first constraint of problem (6) is always 
satisfied if the constraint y i (w T x i + b) > -f/2 is satis- 
fied, thus (6) is equivalent to (7). 

Equivalence of (8) to standard SVM 
formulation 

We will show that problem (8) is equivalent to stan- 
dard SVM. In fact (1) also scales with uniform scaling 
of w, b and to avoid this problem, ||w||7 is fixed to 1 
which lead to the second formula (2). We will show 
that the two ways of avoiding scaling problem, by forc- 
ing ||w||7 = 1 or by forcing ||w|| = 1 are equivalent. 
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Indeed, another way to avoid the scaling problem is 
to find a quantity which is invariant to the scaling of 
w. Let 7 = t||w|| p , hence i : ^ oo, and lets fix 
||w|| p = 1. Then let P be the feasible set of w which 
satisfies (||w|| p = 1 and yi(w T Xi + b) > 0,Vi), and 
Q = {w|||w|| p = 1 and yi(-w T Xi + b) > i||w|| p ),Vi} 
Notice that, if t = then Q = P, if t > then Q C P, 
and if t > t max then Q will be empty. For another 
value of ||w|| p , ||w|| p = A, the corresponding feasible 
sets are P\ and Q\. There is a one to one mapping 
from P to Pa, and from Q to Q\, and the t maXx which 
makes Q\ empty is the same as t max . So t max is in- 
variant to the scaling of w. Therefore (8) is equivalent 
to: 

max t (18) 

w,6,t 

s.t. j/ i (w T x i + b) > t||w||p, Vi, t||w||p = 1 

The value of the geometric margin here is fixed to 
l/Vd while in standard SVM the geometric margin is 
7 = l/||"vv"|||. Using the l 2 norm of w, we get a formu- 
lation which is equivalent to that of the hard margin 
SVM given in (2). Using the l\ norm of w, we get the 
1-norm SVM [17]. 



